Introduction

The “Avengers” dataset is a collection of data on characters from the Marvel comic book series “The Avengers”. This dataset contains information such as the gender, age, and number of appearances of each character, as well as other details such as their alignment (i.e. hero or villain), their status as an Avenger or not, and the issue number of their first appearance.

In this project, I will explore different statistical techniques for analyzing the “Avengers” dataset. Specifically, I will examine the distribution of a numerical variable, demonstrate the applicability of the Central Limit Theorem using random samples, and investigate various sampling methods that can be used on the dataset. I will also draw conclusions about the strengths and limitations of different sampling methods, and discuss the implications of these results for future analyses of the “Avengers” dataset. # REQUIRE LIBRARY

Read the csv file

The data is getting from following url :https://raw.githubusercontent.com/fivethirtyeight/data/master/avengers/avengers.csv

avengers <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/avengers/avengers.csv")

Analysis of Categorical and Numerical Variables

For this analysis, I will look at the distribution of the Gender variable, which is categorical, and the Appearances variable, which is numerical. I will create a bar chart to visualize the distribution of Gender, and a histogram to visualize the distribution of Appearances.

library(ggplot2)

# Bar chart of Gender distribution
ggplot(avengers, aes(x = Gender)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Gender Distribution of Avengers", x = "Gender", y = "Count")

# Histogram of Appearances distribution
ggplot(avengers, aes(x = Appearances)) +
  geom_histogram(fill = "steelblue", binwidth = 100) +
  labs(title = "Distribution of Appearances", x = "Appearances", y = "Count")

The bar chart shows that there are more male Avengers than female, with a ratio of about 4:1. The histogram shows that the distribution of Appearances is skewed to the right, with a long tail indicating that there are a few Avengers who have appeared in a very large number of issues. # Analysis of Two Variables For this analysis, I will look at the relationship between the Year and Appearances variables.I will create a scatter plot to visualize this relationship.

# Scatter plot of Year and Appearances
ggplot(avengers, aes(x = Year, y = Appearances)) +
  geom_point(color = "steelblue") +
  labs(title = "Relationship Between Year and Appearances", x = "Year", y = "Appearances")

The scatter plot shows that there is a positive relationship between Year and Appearances, with a few outliers who have appeared in a very large number of issues. This suggests that as time goes on, Avengers tend to appear in more issues. However, I should note that this relationship may be confounded by other factors, such as changes in the comic book industry or the popularity of the Avengers franchise.

One variable with numerical data and examine the distribution of the data.

In this part, I want to choose the “Appearances” variable, which represents the number of comic book issues in which the character appeared. I can visualize the distribution using a histogram.

ggplot(avengers, aes(x=Appearances)) +
  geom_histogram(binwidth=100) +
  xlab("Number of Appearances") +
  ylab("Count")

I can see that the distribution is heavily skewed to the right, with a long tail of characters who appeared in many comic book issues. # Draw various random samples of the data and show the applicability of the Central Limit Theorem for this variable. I want to choose the “Appearances” variable, which represents the number of comic book issues in which the character appeared. I can visualize the distribution using a histogram:

n_samples <- 1000  # number of samples to draw
sample_size <- 30  # sample size
sample_means <- numeric(n_samples)  # empty vector to store sample means

for (i in 1:n_samples) {
  sample <- sample(avengers$Appearances, size=sample_size, replace=TRUE)
  sample_means[i] <- mean(sample)
}
ggplot(data.frame(sample_means), aes(x=sample_means)) +
  geom_histogram(binwidth=10) +
  xlab("Sample Mean") +
  ylab("Count")

I find that the distribution of the sample means is approximately normal, even though the original distribution of the “Appearances” variable was heavily skewed. This demonstrates the applicability of the Central Limit Theorem for this variable. # Show how various sampling methods can be used on your data. What are your conclusions if these samples are used instead of the whole dataset. 

# Simple random sample
set.seed(100)
srs <- avengers %>% sample_n(50)
# Stratified random sample
stratified <- avengers %>% group_by(Gender) %>% 
              sample_n(size = 10) %>% ungroup()
# Cluster sample
cluster <- avengers %>% slice(1:10)
# Systematic sample
systematic <- avengers[c(1, 11, 21, 31, 41, 51, 61, 71, 81, 91, 101, 111, 121),]
# Convenience sample
convenience <- avengers %>% filter(Appearances >= 100)
srs
## # A tibble: 50 × 21
##    URL      Name/…¹ Appea…² Curre…³ Gender Proba…⁴ Full/…⁵  Year Years…⁶ Honor…⁷
##    <chr>    <chr>     <dbl> <chr>   <chr>  <chr>   <chr>   <dbl>   <dbl> <chr>  
##  1 http://… "Alias…     121 NO      MALE   <NA>    6-Feb    2006       9 Full   
##  2 http://… "Robbi…     299 NO      MALE   <NA>    10-Jun   2010       5 Full   
##  3 http://… "Marcu…      65 YES     MALE   <NA>    13-Feb   2013       2 Full   
##  4 http://… "Rober…    2089 YES     MALE   <NA>    Sep-63   1963      52 Full   
##  5 http://… "Rita …      68 NO      FEMALE <NA>    Nov-88   1988      27 Honora…
##  6 http://…  <NA>        16 NO      FEMALE <NA>    5-Jul    2005      10 Full   
##  7 http://… "Willi…     123 YES     MALE   <NA>    5-Apr    2005      10 Full   
##  8 http://… "Anya …     108 YES     FEMALE <NA>    <NA>     1900     115 Academy
##  9 http://… "Steve…    3458 YES     MALE   <NA>    Mar-64   1964      51 Full   
## 10 http://… "Marc …     402 NO      MALE   Sep-87  Jun-88   1988      27 Full   
## # … with 40 more rows, 11 more variables: Death1 <chr>, Return1 <chr>,
## #   Death2 <chr>, Return2 <chr>, Death3 <chr>, Return3 <chr>, Death4 <chr>,
## #   Return4 <chr>, Death5 <chr>, Return5 <chr>, Notes <chr>, and abbreviated
## #   variable names ¹​`Name/Alias`, ²​Appearances, ³​`Current?`,
## #   ⁴​`Probationary Introl`, ⁵​`Full/Reserve Avengers Intro`,
## #   ⁶​`Years since joining`, ⁷​Honorary
stratified
## # A tibble: 20 × 21
##    URL      Name/…¹ Appea…² Curre…³ Gender Proba…⁴ Full/…⁵  Year Years…⁶ Honor…⁷
##    <chr>    <chr>     <dbl> <chr>   <chr>  <chr>   <chr>   <dbl>   <dbl> <chr>  
##  1 http://… Fiona         2 YES     FEMALE <NA>    <NA>     1900     115 Academy
##  2 http://… Bonita…      83 NO      FEMALE <NA>    Sep-87   1987      28 Full   
##  3 http://… Ava Ay…      49 YES     FEMALE <NA>    14-Jan   2014       1 Full   
##  4 http://… Monica…     348 YES     FEMALE Jan-83  May-83   1983      32 Full   
##  5 http://… Jessic…     205 YES     FEMALE <NA>    10-Aug   2010       5 Full   
##  6 http://… <NA>         28 NO      FEMALE <NA>    Jun-93   1993      22 Honora…
##  7 http://… Monica…      12 YES     FEMALE <NA>    13-Sep   2013       2 Full   
##  8 http://… Sharon…     333 NO      FEMALE <NA>    10-May   2010       5 Full   
##  9 http://… Circe       237 NO      FEMALE <NA>    Feb-90   1990      25 Full   
## 10 http://… Americ…      22 YES     FEMALE <NA>    13-Jul   2013       2 Full   
## 11 http://… Jacque…     115 NO      MALE   <NA>    Sep-65   1965      50 Full   
## 12 http://… Nichol…      77 YES     MALE   <NA>    13-Apr   2013       2 Full   
## 13 http://… Philli…      31 NO      MALE   <NA>    Dec-92   1992      23 Honora…
## 14 http://… Eric O…      88 NO      MALE   <NA>    10-May   2010       5 Full   
## 15 http://… Nathan…      23 NO      MALE   <NA>    5-Apr    2005      10 Full   
## 16 http://… Scott …     217 NO      MALE   Jan-87  3-Feb    2003      12 Full   
## 17 http://… Delroy…     101 NO      MALE   <NA>    Apr-00   2000      15 Full   
## 18 http://… James …     533 NO      MALE   May-84  Sep-84   1984      31 Full   
## 19 http://… Loki L…      77 NO      MALE   <NA>    13-Jul   2013       2 Full   
## 20 http://… Wade W…     575 NO      MALE   <NA>    7-Sep    2007       8 Full   
## # … with 11 more variables: Death1 <chr>, Return1 <chr>, Death2 <chr>,
## #   Return2 <chr>, Death3 <chr>, Return3 <chr>, Death4 <chr>, Return4 <chr>,
## #   Death5 <chr>, Return5 <chr>, Notes <chr>, and abbreviated variable names
## #   ¹​`Name/Alias`, ²​Appearances, ³​`Current?`, ⁴​`Probationary Introl`,
## #   ⁵​`Full/Reserve Avengers Intro`, ⁶​`Years since joining`, ⁷​Honorary
cluster
## # A tibble: 10 × 21
##    URL      Name/…¹ Appea…² Curre…³ Gender Proba…⁴ Full/…⁵  Year Years…⁶ Honor…⁷
##    <chr>    <chr>     <dbl> <chr>   <chr>  <chr>   <chr>   <dbl>   <dbl> <chr>  
##  1 http://… "Henry…    1269 YES     MALE   <NA>    Sep-63   1963      52 Full   
##  2 http://… "Janet…    1165 YES     FEMALE <NA>    Sep-63   1963      52 Full   
##  3 http://… "Antho…    3068 YES     MALE   <NA>    Sep-63   1963      52 Full   
##  4 http://… "Rober…    2089 YES     MALE   <NA>    Sep-63   1963      52 Full   
##  5 http://… "Thor …    2402 YES     MALE   <NA>    Sep-63   1963      52 Full   
##  6 http://… "Richa…     612 YES     MALE   <NA>    Sep-63   1963      52 Honora…
##  7 http://… "Steve…    3458 YES     MALE   <NA>    Mar-64   1964      51 Full   
##  8 http://… "Clint…    1456 YES     MALE   <NA>    May-65   1965      50 Full   
##  9 http://… "Pietr…     769 YES     MALE   <NA>    May-65   1965      50 Full   
## 10 http://… "Wanda…    1214 YES     FEMALE <NA>    May-65   1965      50 Full   
## # … with 11 more variables: Death1 <chr>, Return1 <chr>, Death2 <chr>,
## #   Return2 <chr>, Death3 <chr>, Return3 <chr>, Death4 <chr>, Return4 <chr>,
## #   Death5 <chr>, Return5 <chr>, Notes <chr>, and abbreviated variable names
## #   ¹​`Name/Alias`, ²​Appearances, ³​`Current?`, ⁴​`Probationary Introl`,
## #   ⁵​`Full/Reserve Avengers Intro`, ⁶​`Years since joining`, ⁷​Honorary
systematic
## # A tibble: 13 × 21
##    URL      Name/…¹ Appea…² Curre…³ Gender Proba…⁴ Full/…⁵  Year Years…⁶ Honor…⁷
##    <chr>    <chr>     <dbl> <chr>   <chr>  <chr>   <chr>   <dbl>   <dbl> <chr>  
##  1 http://… "Henry…    1269 YES     MALE   <NA>    Sep-63   1963      52 Full   
##  2 http://… "Jacqu…     115 NO      MALE   <NA>    Sep-65   1965      50 Full   
##  3 http://… "Matth…     197 NO      MALE   <NA>    Aug-75   1975      40 Full   
##  4 http://… "Carol…     935 YES     FEMALE <NA>    Apr-79   1979      36 Full   
##  5 http://… "Benja…    2305 NO      MALE   <NA>    Jun-86   1986      29 Full   
##  6 http://… "Scott…     217 NO      MALE   Jan-87  3-Feb    2003      12 Full   
##  7 http://… "Ashle…      36 YES     FEMALE <NA>    Jul-89   1989      26 Full   
##  8 http://… "Wade …     575 NO      MALE   <NA>    7-Sep    2007       8 Full   
##  9 http://…  <NA>        28 NO      FEMALE <NA>    Jun-93   1993      22 Honora…
## 10 http://… "Carl …     886 YES     MALE   <NA>    5-Mar    2005      10 Full   
## 11 http://… "Kathe…     132 YES     FEMALE <NA>    5-Jun    2005      10 Full   
## 12 http://… "Maria…     359 YES     FEMALE <NA>    10-May   2010       5 Full   
## 13 http://… "John …      31 YES     MALE   <NA>    10-Dec   2010       5 Full   
## # … with 11 more variables: Death1 <chr>, Return1 <chr>, Death2 <chr>,
## #   Return2 <chr>, Death3 <chr>, Return3 <chr>, Death4 <chr>, Return4 <chr>,
## #   Death5 <chr>, Return5 <chr>, Notes <chr>, and abbreviated variable names
## #   ¹​`Name/Alias`, ²​Appearances, ³​`Current?`, ⁴​`Probationary Introl`,
## #   ⁵​`Full/Reserve Avengers Intro`, ⁶​`Years since joining`, ⁷​Honorary
convenience
## # A tibble: 105 × 21
##    URL      Name/…¹ Appea…² Curre…³ Gender Proba…⁴ Full/…⁵  Year Years…⁶ Honor…⁷
##    <chr>    <chr>     <dbl> <chr>   <chr>  <chr>   <chr>   <dbl>   <dbl> <chr>  
##  1 http://… "Henry…    1269 YES     MALE   <NA>    Sep-63   1963      52 Full   
##  2 http://… "Janet…    1165 YES     FEMALE <NA>    Sep-63   1963      52 Full   
##  3 http://… "Antho…    3068 YES     MALE   <NA>    Sep-63   1963      52 Full   
##  4 http://… "Rober…    2089 YES     MALE   <NA>    Sep-63   1963      52 Full   
##  5 http://… "Thor …    2402 YES     MALE   <NA>    Sep-63   1963      52 Full   
##  6 http://… "Richa…     612 YES     MALE   <NA>    Sep-63   1963      52 Honora…
##  7 http://… "Steve…    3458 YES     MALE   <NA>    Mar-64   1964      51 Full   
##  8 http://… "Clint…    1456 YES     MALE   <NA>    May-65   1965      50 Full   
##  9 http://… "Pietr…     769 YES     MALE   <NA>    May-65   1965      50 Full   
## 10 http://… "Wanda…    1214 YES     FEMALE <NA>    May-65   1965      50 Full   
## # … with 95 more rows, 11 more variables: Death1 <chr>, Return1 <chr>,
## #   Death2 <chr>, Return2 <chr>, Death3 <chr>, Return3 <chr>, Death4 <chr>,
## #   Return4 <chr>, Death5 <chr>, Return5 <chr>, Notes <chr>, and abbreviated
## #   variable names ¹​`Name/Alias`, ²​Appearances, ³​`Current?`,
## #   ⁴​`Probationary Introl`, ⁵​`Full/Reserve Avengers Intro`,
## #   ⁶​`Years since joining`, ⁷​Honorary

I have demonstrated five different sampling methods: simple random sampling, stratified random sampling, cluster sampling, systematic sampling, and convenience sampling.

A simple random sample involves selecting a random subset of the observations from the population. In this case, I have randomly selected 50 characters from the “Avengers” data-set. The representativeness of this sample depends on whether it is truly random and whether it adequately captures the variation in the original dataset.

Stratified random sampling involves dividing the population into subgroups (strata) and selecting a random sample from each subgroup. In this case, I have stratified the data-set by the “Gender” variable and selected 10 characters from each subgroup. This can be a useful sampling method if there are important subgroups in the population that need to be represented in the sample.

Cluster sampling involves dividing the population into clusters and selecting a random sample of clusters to include in the study. In this case, I have selected the first 10 characters in the dataset as a cluster sample. This can be a useful sampling method if the population is geographically or otherwise clustered.

Systematic sampling involves selecting every nth observation from the population. In this case, I have selected every 10th character from the “Avengers” data-set. This can be a useful sampling method if the population is ordered in some way (e.g., alphabetically).

Convenience sampling involves selecting the most readily available observations. In this case, I have selected all characters with 100 or more comic book appearances. This is generally not a representative sampling method, as it is subject to bias based on what is convenient to the researcher.

In conclusion, the choice of sampling method depends on the research question and the characteristics of the population. While some sampling methods can be useful for certain types of populations or research questions, others can introduce bias or inadequately capture the variation in the population. It is important to carefully consider the sampling method and its potential limitations before drawing conclusions based on a sample. ## Use Data wrangling techniques for the appropriate analysis of your data.

avengers_filtered <- avengers %>% filter(!is.na(Gender))

# summarize number of Avengers by gender
avengers_summary <- avengers_filtered %>% group_by(Gender) %>% summarise(count = n())

# view the summary data
avengers_summary
## # A tibble: 2 × 2
##   Gender count
##   <chr>  <int>
## 1 FEMALE    58
## 2 MALE     115

Use plotly for your plots for interactivity

# explore the data set
glimpse(avengers)
## Rows: 173
## Columns: 21
## $ URL                           <chr> "http://marvel.wikia.com/Henry_Pym_(Eart…
## $ `Name/Alias`                  <chr> "Henry Jonathan \"Hank\" Pym", "Janet va…
## $ Appearances                   <dbl> 1269, 1165, 3068, 2089, 2402, 612, 3458,…
## $ `Current?`                    <chr> "YES", "YES", "YES", "YES", "YES", "YES"…
## $ Gender                        <chr> "MALE", "FEMALE", "MALE", "MALE", "MALE"…
## $ `Probationary Introl`         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ `Full/Reserve Avengers Intro` <chr> "Sep-63", "Sep-63", "Sep-63", "Sep-63", …
## $ Year                          <dbl> 1963, 1963, 1963, 1963, 1963, 1963, 1964…
## $ `Years since joining`         <dbl> 52, 52, 52, 52, 52, 52, 51, 50, 50, 50, …
## $ Honorary                      <chr> "Full", "Full", "Full", "Full", "Full", …
## $ Death1                        <chr> "YES", "YES", "YES", "YES", "YES", "NO",…
## $ Return1                       <chr> "NO", "YES", "YES", "YES", "YES", NA, "Y…
## $ Death2                        <chr> NA, NA, NA, NA, "YES", NA, NA, "YES", NA…
## $ Return2                       <chr> NA, NA, NA, NA, "NO", NA, NA, "YES", NA,…
## $ Death3                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Return3                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Death4                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Return4                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Death5                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Return5                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Notes                         <chr> "Merged with Ultron in Rage of Ultron Vo…
# create an interactive scatter plot of the number of appearances by year
avengers %>%
  group_by(Year) %>%
  summarise(appearances = n()) %>%
  plot_ly(x = ~Year, y = ~appearances, type = "scatter", mode = "markers") %>%
  add_markers(color = ~appearances, colorscale = "Viridis", size = 5) %>%
  layout(xaxis = list(title = "Year"), yaxis = list(title = "Number of Appearances"))

Conclucsion

In this project, I explored different statistical techniques for analyzing the “Avengers” dataset. I started by examining the distribution of a numerical variable, which allowed us to understand the central tendency and spread of the data. I then demonstrated the applicability of the Central Limit Theorem by drawing various random samples from the dataset and showing that the mean of these samples tends to be normally distributed around the population mean, regardless of the distribution of the population.

Next, I investigated various sampling methods that can be used on the “Avengers” dataset, including simple random sampling, stratified random sampling, cluster sampling, systematic sampling, and convenience sampling. I found that different sampling methods can have varying degrees of representativeness and bias, depending on the structure and characteristics of the dataset. Therefore, it is important to carefully consider the sampling method used in any analysis to ensure the validity and generalizability of the results.

In conclusion, this project demonstrated some of the key statistical concepts and techniques that can be used to analyze the “Avengers” dataset, which can be applied to other datasets as well. By using appropriate statistical methods, I can gain insights and make informed decisions based on data-driven evidence, ultimately improving our understanding of complex phenomena and informing effective strategies for action.